This paper presents ReasonFormer, a unified reasoning framework for mirroring the modular and compositional reasoning process of humans in complex decision-making. Inspired by dual-process theory in cognitive science, the representation module (automatic thinking) and reasoning modules (controlled thinking) are decoupled to capture different levels of cognition. Upon the top of the representation module, the pre-trained reasoning modules are modular and professional in specific and fundamental reasoning skills (e.g., logic, simple QA, etc). To mimic the controlled compositional thinking process, different reasoning modules are dynamically activated and composed in both parallel and cascaded manners to control what reasoning skills are activated and how deep the reasoning process will be reached to solve the current problems. The unified reasoning framework solves multiple tasks with a single model, and is trained and inferred in an end-to-end manner. Evaluated on 11 datasets requiring different reasoning skills and complexity, ReasonFormer demonstrates substantial performance boosts, revealing the compositional reasoning ability. Few-shot experiments exhibit better generalization ability by learning to compose pre-trained skills for new tasks with limited data, and decoupling the representation module and the reasoning modules. Further analysis shows the modularity of reasoning modules as different tasks activate distinct reasoning skills at different reasoning depths.
translated by 谷歌翻译
与单峰数据相比,多模式数据可以提供更多功能来帮助模型分析数据的情感。先前的研究作品很少考虑令牌级的功能融合,很少有工作探索学习与多模式数据中情感相关的共同特征,以帮助模型融合多模式功能。在本文中,我们提出了一种对比度学习和多层融合(CLMLF)方法,用于多模式情感检测。具体来说,我们首先编码文本和图像以获取隐藏的表示形式,然后使用多层融合模块来对齐和融合文本和图像的令牌级特征。除了情感分析任务外,我们还设计了两个对比学习任务,基于标签的对比度学习和基于数据的对比学习任务,这将帮助该模型学习与多模式数据中情感相关的共同特征。与现有方法相比,对三个公开多模式数据集进行的广泛实验证明了我们对多模式情感检测的有效性。这些代码可在https://github.com/link-li/clmlf上使用
translated by 谷歌翻译
面对面对话期间的响应声是社会互动的关键要素,在心理学研究中得到了很好的建立。通过非言语信号响应扬声器的话语,语调或行为实时,听众展示了它们如何从事对话。在这项工作中,我们构建了响应声侦听器数据集(RLD),从公共资源收集的对话视频语料库,其中包括67个扬声器,76个听众,具有三种不同的态度。我们将响应声聆听头生成任务定义为具有运动的运动和表达式的非言语头的合成,包括扬声器的音频和视觉信号。与言语驱动的手势或谈话主管不同,我们在这项任务中介绍了更多的模态,希望有利于几个研究领域,包括人类互动,视频到视频转换,跨模型理解和生成。此外,我们释放了一种态度调节的听力头生成基线。项目页面:\ url {https://project.mhzhou.com/rld}。
translated by 谷歌翻译
作为具有高时间分辨率的生物启发传感器,尖峰摄像机在真实应用中具有巨大的潜力,特别是在高速场景中的运动估计。然而,由于数据模式不同,基于帧的基于事件的方法并不适合从尖峰相机的尖峰流。为此,我们展示,Scflow,一种量身定制的深度学习管道,以估计来自尖峰流的高速场景中的光学流量。重要的是,引入了一种新的输入表示,其可以根据先前运动自适应地从尖峰流中自适应地移除运动模糊。此外,对于训练Scflow,我们为Spiking Camera的两组光学流量数据合成了两组光学流量数据,尖锐的东西和光处理的高速运动,分别表示为乘坐和PHM,对应于随机的高速和精心设计的场景。实验结果表明,SC流程可以预测不同高速场景中的尖峰流的光流。此外,Scflow显示了\真正的尖峰流的有希望的泛化。发布后,所有代码和构造数据集将发布。
translated by 谷歌翻译
多模式机器翻译(MMT)通过引入视觉信息来提高翻译质量。但是,现有的MMT模型忽略了图像将带来与文本无关的信息的问题,从而引起了大量噪音并影响翻译质量。本文提出了一种用于多模式机器翻译的新型Gumbel注意事项,它选择了图像特征的文本相关部分。具体而言,与以前的基于注意的方法不同,我们首先使用可区分的方法选择图像信息并自动删除图像功能的无用部分。实验证明我们的方法保留了与文本相关的图像特征,其余部分帮助MMT模型生成更好的翻译。
translated by 谷歌翻译
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
沟通可以帮助代理商获得有关他人的信息,以便可以学习更好的协调行为。一些现有的工作会与其他人传达预测的未来轨迹,希望能为其他人做些更好的协调能力提供线索。但是,当对代理人同步处理时,有时会发生循环依赖性,因此很难协调决策。在本文中,我们提出了一种新颖的交流方案,顺序通信(SEQCOMM)。 Seqcomm不同步(高级代理在低级阶段之前做出决定),并有两个通信阶段。在谈判阶段,代理通过传达观测的隐藏状态并比较意图的价值来确定决策的优先级,这是通过对环境动态进行建模来获得的。在发射阶段,高级代理商领导着做出决策并与低级代理商进行交流。从理论上讲,我们证明Seqcomm学到的政策可以单调地改善并融合。从经验上讲,我们表明SEQCOMM在各种多机构合作任务中都优于现有方法。
translated by 谷歌翻译
深度估计对于各种重要的现实世界应用至关重要,例如自动驾驶。但是,在高速场景中,它遭受了严重的性能退化,因为传统相机只能捕获模糊的图像。为了解决这个问题,Spike摄像头旨在以高框架速率捕获像素的亮度强度。但是,使用传统的单眼或立体声深度估计算法,使用尖峰摄像机的深度估计仍然非常具有挑战性,这些算法基于光度一致性。在本文中,我们提出了一种新型的不确定性引导深度融合(UGDF)框架,以融合Spike摄像机的单眼和立体声深度估计网络的预测。我们的框架是由于立体声尖峰深度估计在近距离取得更好的结果,而单眼尖峰深度估计获得了更好的结果。因此,我们引入了具有联合培训策略的双任务深度估计结构,并估算了分布式不确定性以融合单眼和立体声结果。为了证明尖峰深度估计比传统的摄像头深度估计的优势,我们为一个名为CitySpike20k的尖峰深度数据集,其中包含20k配对的样品,以进行尖峰深度估计。 UGDF在CitySpike20k上取得了最新的结果,超过了所有单眼或立体声尖峰深度估计基线。我们进行了广泛的实验,以评估我们方法对CitySpike20k的有效性和概括。据我们所知,我们的框架是第一个用于尖峰摄像头深度估算的双任务融合框架。代码和数据集将发布。
translated by 谷歌翻译
神经形态尖峰摄像机以生物启发的方式生成具有高时间分辨率的数据流,该方式在自动驾驶等现实世界应用中具有巨大的潜力。与RGB流相反,Spike流具有克服运动模糊的固有优势,从而导致对高速对象的更准确的深度估计。但是,几乎不可能以监督的方式培训尖峰深度估计网络,因为获得时间密集的尖峰流的配对深度标签非常费力和挑战。在本文中,我们没有构建带有完整深度标签的Spike流数据集,而是以不受监督的方式从开源RGB数据集(例如Kitti)和估算峰值深度转移知识。此类问题的关键挑战在于RGB和SPIKE模式之间的模态差距,以及标记的源RGB和未标记的目标尖峰域之间的域间隙。为了克服这些挑战,我们引入了无监督的尖峰深度估计的跨模式跨域(BICROSS)框架。我们的方法通过引入中介模拟的源尖峰域来缩小源RGB和目标尖峰之间的巨大差距。要具体而言,对于跨模式阶段,我们提出了一种新颖的粗到精细知识蒸馏(CFKD),将图像和像素级知识从源RGB转移到源尖峰。这种设计分别利用了RGB和SPIKE模式的大量语义和密集的时间信息。对于跨域阶段,我们引入了不确定性引导的均值老师(UGMT),以生成具有不确定性估计的可靠伪标签,从而减轻了源尖峰和目标尖峰域之间的变化。此外,我们提出了一种全局级特征对齐方法(GLFA),以对齐两个域之间的特征并生成更可靠的伪标签。
translated by 谷歌翻译